The CBE Hardware Accelerator for Numerical Relativity: A Simple Approach 
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Hardware accelerators (such as the Cell Broadband Engine) have recently received a significant 
amount of attention from the computational science community because they can provide signifi- 
cant gains in the overall performance of many numerical simulations at a low cost. However, such 
accelerators usually employ a rather unfamiliar and specialized programming model that often re- 
quires advanced knowledge of their hardware design. In this article, we demonstrate an alternate 
and simpler approach towards managing the main complexities in the programming of the Cell 
processor, called software caching. We apply this technique to a numerical relativity application: 
a time-domain, finite-difference Kerr black hole perturbation evolver, and present the performance 
results. We obtain gains in the overall performance of generic simulations that are close to the 
theoretical maximum that can be obtained through our parallelization approach. 



I. INTRODUCTION 

Computer simulations are playing an increasingly im- 
portant role in nearly every area of science and engineer- 
ing today The main drive behind this trend is the rapid 
increase in the overall performance of computer hard- 
ware over the past several decades {Moore 's Law) and its 
relatively low cost. 

It is interesting to note that the overall performance of 
certain specific computing technologies, such as graphics 
cards, gaming consoles, etc. has continued to increase 
at a rate much higher than that of traditional work- 
station processors, thus making scientific computing on 
such devices an intriguing possibility: Compute Unified 
Device Architecture (CUDA) Q is NVIDIA's general- 
purpose software development system for graphics pro- 
cessing units (GPUs); Cell Broadband Engine (CBE) [2] 
is a processor that was designed by a collaboration be- 
tween Sony, Toshiba, and IBM and is being used in 
gaming consoles (Sony's Playstation [3[) as well as high- 
performance computing hardware (IBM's Cell blades [J], 
LANL RoadRunner d). 

In this article, we will focus entirely on the CBE and 
demonstrate its use in the acceleration of scientific com- 
puting applications. In particular, we take a sample 
application from the numerical relativity (NR) commu- 
nity - a Teukolsky equation solver Q which is a finite- 
difference (2+l)D linear, hyperbolic, homogeneous par- 
tial differential equation (PDE) solver - and implement 
the low-level parallelism that is offered by the CBE archi- 
tecture. We describe the approach taken and its outcome 
in detail. This NR application is fairly generic, i.e. it is 
of a type that is quite common in various fields of science 
and engineering. Therefore our work would be of interest 
to a larger community of computational scientists. 

It is worth pointing out that various researchers have 
performed a similar evaluation of the CBE using other 
finite-difference PDE solvers Q with very promising re- 
sults. These studies typically involved extensive coding 
at a level that requires rather specialized and advanced 



knowledge of the CBE's design. Indeed, such an in- 
vestment is very necessary if the goal is to extract the 
maximum amount of performance that the processor can 
offer. However, our approach in this article is one in 
which we will make a considerably smaller investment 
in the specialized coding and yet obtain strong gains in 
overall application performance. Our approach makes 
use of a software caching mechanism on the Cell's Syner- 
gistic Processing Elements (SPEs) that was made avail- 
able recently in IBM's Cell SDK 0. We describe this 
less-known mechanism and our implementation in detail, 
later in this article. 



II. NUMERICAL RELATIVITY 

Several gravitational wave observatories Q are cur- 
rently being built all over the world: LIGO in the United 
States, GEO/Virgo in Europe and TAMA in Japan. 
These observatories will open a new window onto the 
Universe by enabling scientists to make astronomical ob- 
servations using a completely new medium - gravita- 
tional waves (GW) , as opposed to electromagnetic waves 
(light). These waves were predicted by Einstein's relativ- 
ity theory, but have not been directly observed because 
the required experimental sensitivity was simply not ad- 
vanced enough, until very recently. 

Numerical relativity is an area of computational sci- 
ence that emphasizes the detailed modeling of strong 
sources of GWs - collisions of compact astrophysical ob- 
jects, such as neutron stars and black holes. Thus, it 
plays an extremely important role in the area of GW 
astronomy and gravitational physics, in general. More- 
over, the NR community has also contributed to the 
broader computational science community by developing 
an open-source, modular, parallel computing infrastruc- 
ture called Cactus 

The specific NR application we have chosen for consid- 
eration in this work is one that evolves the perturbations 
of a rotating (Kerr) black hole, i.e. solves the Teukol- 



2 



sky equation Q in the time-domain. This equation is 
essentially a linear wave-equation in Kerr space-time ge- 
ometry. The next two subsections provide more detailed 
information on this equation and the associated numeri- 
cal solver code. 



A. Teukolsky Equation 

The Teukolsky master equation describes scalar, vec- 
tor and tensor field perturbations in the space-time of 
Kerr black holes In Boyer-Lindquist coordinates, 

this equation takes the form 
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where M is the mass of the black hole, a its angular mo- 
mentum per unit mass, A = r 2 — 2Mr + a 2 and s is the 
"spin weight" of the field. The s = ±2 versions of these 
equations describe the radiative degrees of freedom of the 
gravitational field, and thus are the equations of interest 
here. As mentioned previously, this equation is an ex- 
ample of linear, hyperbolic, homogeneous PDEs which 
are quite common in several areas of science and engi- 
neering, and can be solved numerically using a variety of 
finite-difference schemes. 



B. Teukolsky Code 

Ref. pjl demonstrated stable numerical evolution of 
Eq. |T]) for s = — 2 using the well-known Lax-Wendroff 
numerical evolution scheme. Our Teukolsky code uses 
the exact same approach, therefore the contents of this 
section are largely a review of the work presented in the 
relevant literature [Hj]. 

Our code uses the tortoise coordinate r* in the radial 
direction and azimuthal coordinate <f>. These coordinates 
are related to the usual Boyer-Lindquist coordinates by 



dr* = 
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Following Ref. [l2j, we factor out the azimuthal depen- 
dence and use the ansatz, 



V(t,r*,0,j>) = e mi< V$(i,r*,6>). 
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allows the Teukolsky equation to be rewritten as 
dtu + Md n u + Lu + Au = 0, 
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is the solution vector. The subscripts R and I refer to 
the real and imaginary parts respectively (note that the 
Teukolsky function ^ is a complex valued quantity) . Ex- 
plicit forms for the matrices M, A and L can be easily 
found in the relevant literature [12|. Rewriting Eq. ((SJ) 
as 



d t u + Dd r *u = S 



where 



D 



(b 
6 
0-60 

Vo -bj 



S = — (M - D)d r ,u Lu Au, 



(10) 



(11) 



(12) 



and using the Lax-Wendroff iterative scheme, we obtain 
stable evolutions. Each iteration consists of two steps: 
In the first step, the solution vector between grid points 
is obtained from 
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This is used to compute the solution vector at the next 
time step, 
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The angular subscripts are dropped in the above equation 
for clarity. All angular derivatives are computed using 
second-order, centered finite difference expressions. 

Following Ref. [13], we set <I> and LT to zero on the 
inner and outer radial boundaries. Symmetries of the 
spheroidal harmonics are used to determine the angu- 
lar boundary conditions: For even |m| modes, we have 
dg$ = at 6 = 0, 7r while $ = at 9 = 0, ir for modes of 
odd I ml. 
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As a sample numerical result, we take initial data cor- 
responding to a narrow Gaussian pulse. This data per- 
turbs the black hole, causing it to ring down according to 
its characteristic quasi- normal frequencies. Fig. [1] shows 
the results, illustrating the quasi-normal ringing for the 
I = 0, m = mode of a black hole with spin parameter 
a = 0.9. 
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FIG. 1: The quasi-normal ringing for a black hole with a/M = 
0.9; the I = 0, m = mode is shown here. The evolution of 
the real part of the Teukolsky function, extracted at r — 20M, 
6 = it/2. 



III. CELL BROADBAND ENGINE 



memory directly, but the SPEs can only directly access 
their own, rather limited (256KB) local-store. This poses 
a challenge for many applications, including the NR ap- 
plication that we are considering in this article. How- 
ever, the compilers in the most recent release (3.1) of 
IBM's Cell SDK enable a software caching mechanism 
that allows for the use of the SPE local-store as a con- 
ventional cache, thus negating the need of transferring 
data manually from main memory. In essence, this mech- 
anism allows one to write SPE programs that can access 
variables stored in the PPE's address space (IBM refers 
this to as effective address space support). From our 
viewpoint, this feature takes away the main challenge 
associated to programming the CBE, and thus consid- 
erably reduces the amount of specialized code develop- 
ment needed. Therefore, we will take this approach in 
the development of a CBE optimized version of our NR 
application. Of course, it is worth pointing out that this 
approach will not allow the processor to perform at full 
potential; thus one should not expect to achieve the max- 
imal performance gain that would be possible otherwise. 

Another important mechanism that allows communi- 
cation between the the different elements (PPE, SPEs) of 
the CBE is the use of mailboxes. These are special pur- 
pose registers that can be used for uni-directional com- 
munication. Each SPE has three (3) mailboxes - two (2) 
outbound, that can hold only a single entry, and one (1) 
inbound, that can hold four (4) entries. These are typi- 
cally used for synchronizing the computation across the 
SPEs and the PPE, and that is primarily how we made 
use of these registers as well. Details on our specific use of 
these various aspects of the CBE for the NR application 
appear in the next section of this article. 



The CBE is a completely new processor, that was de- 
veloped collaboratively by Sony, IBM and Toshiba pri- 
marily for multimedia applications. This processor has a 
general purpose (PowerPC) CPU, called the PPE (that 
can execute two (2) software threads simultaneously) and 
eight (8) special-purpose compute engines, called SPEs 
only for numerical computation. Each SPE can perform 
vector operations, which implies that it can compute on 
multiple data, in a single instruction (SIMD). All of these 
compute elements are connected to each other through 
a high-speed interconnect bus (EIB). A single 3.2 GHz 
CBE (original - 2006/2007) has a peak performance of 
over 200 GFLOP /s in single-precision floating point com- 
putation and 15 GFLOP/s in double-precision. The cur- 
rent (2008) release of the CBE, called the PowerXCell, 
has design improvements that bring the double-precision 
performance up to 100 GFLOP/s. 

The main programming challenge introduced by this 
new design, is that one has to explicitly manage the 
memory transfer between the PPE and the SPEs. The 
PPE and SPEs are equipped with a DMA engine - a 
mechanism that enables data transfer to and from main 
memory and each other. Now, the PPE can access main 



IV. CBE TEUKOLSKY CODE 
IMPLEMENTATION 

As mentioned in the previous section, our approach to- 
wards running code on the Cell's SPEs involves making 
use of the effective address space support. In this manner 
we avoid the involved programming associated to explic- 
itly managing the data exchange between the PPE and 
the SPEs. Therefore, once we have decided upon the 
portion of the Teukolsky code that would be appropriate 
to run on the SPEs, the rest would be straightforward. 

Since the SPEs are the main compute engines of the 
CBE, one would want them to execute the most com- 
pute intensive tasks of a code in a data- or task- parallel 
fashion, and leave the rest (input/output, synchroniza- 
tion, etc.) to run on the PPE. We employ a data-parallel 
model, which is straightforward to implement in a code 
like ours - we simply perform a domain decomposition 
of our finite-difference numerical grid and allocate the 
different parts of the grid to different SPEs. In partic- 
ular, we parallelize along the r* coordinate dimension, 
because that typically has two orders-of-magnitude more 
grid points than the other (9) dimension. 
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Upon performing a basic profiling of our code using 
the GNU profiler gprof, we learn that the computing 
the "right-hand-sides" of the Lax-Wendroff steps i.e. the 
quantities within the square-brackets of Eqs. (|13[) and 
(fl4]) . take nearly 75% of the application's overall runtime. 
Thus, it is natural to consider accelerating this "right- 
hand-side" computation using data-parallelization on the 
SPEs. We anticipate that this observation is fairly typical 
for codes of this type. 

Now, to move this computation to the SPEs, we sim- 
ply take the corresponding PPE code and copy it into a 
skeleton SPE code and then add in the appropriate dec- 
larations for all the field arrays such that they point to 
the corresponding quantities in the PPE's address space. 
To be more explicit, lets say that we have a field array 
quantity phi defined in the PPE portion of the code as 
follows, 

double phi [M] [N] ; 

where M and N are the dimensions of the array in the 
and r* directions respectively. To point to this exact 
same array from the SPE code, we simply need to declare 
it as, 

extern ea double phi [M] [N] ; 

where the ea type qualifier is a PPE address namespace 

identifier. That entirely takes care of the complicated 
matter of transferring the necessary data back-and-forth 
between the SPEs and the PPE. Each SPE maintains a 
cache (size can be controlled at compile time using the 
-mcache-size flag) in its local-store, wherein it stores 
a small fraction of the whole array for immediate use; 
therefore its functioning is identical to the traditional 
hardware caches found in all modern processors. 

Finally, as far as making use of the SIMD capabilities 
of the SPEs are concerned, we simply enable the auto- 
vectorizing capabilities of the spu-gee compiler, as op- 
posed to performing those optimizations by hand. The 
only other issue remaining, is that of synchronizing the 
PPE and the SPEs, which is easily done by using mail- 
boxes as mentioned earlier. 



V. PERFORMANCE RESULTS 

Recall that the "right-hand-side" computation, that 
we are coding for parallelized execution on the Cell's 
SPEs takes nearly 75% of the overall runtime of the ap- 
plication. This means that at best, we can expect an 
acceleration of a factor of four (4) in the application's 
overall runtime [Amdahl's law). In this section, we com- 
pare our CBE accelerated code's performance relative to 
this maximum achievable gain. 

Fig. [2] depicts the total application runtime as a func- 
tion of the number of SPEs used, for two (2) different 
values of the software cache. One can clearly see that 
although the performance gain using this approach is 
highly dependent on the size of the software cache used, 



the gains are substantial. For the case of the 128A cache 
size, the maximum acceleration of the application is a fac- 
tor of 3.885 - this translates to over 97% of the maximum 
theoretical gain possible through our parallel approach! 

Another observation that can be made from the 128 A" 
cache data plotted in Fig. [2] is that after six (6) SPEs are 
in use, not much is gained by including additional SPEs. 
The reason for this is simply that even with only six (6) 
SPEs in use, the runtime associated to the "right-hand- 
side" computation becomes negligible as a fraction of the 
application's overall runtime. Thus, including additional 
SPEs to further accelerate that computation yields no 
benefit. If we wanted to accelerate the application even 
further, we would need to further port some of the other 
PPE-based computations onto the SPEs. 
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FIG. 2: The overall application runtime of our CBE Teukolsky 
code, as a function of the number of SPEs used, for two (2) 
different software cache sizes. The performance gains owing 
to parallelization are significant, with more gain for larger 
cache size. We also depict the runtime of the PPE-only code 
for comparison purposes. 

It is worth pointing out that these tests were performed 
on an IBM QS21 blade server that is equipped with the 
original (2006/2007) CBE - the one with the rather lim- 
ited double-precision floating-point performance. All of 
our codes were run using double-precision floating-point 
accuracy because that is the common practice in the NR 
community and also a necessity for finite-difference based 
evolutions, especially if a large number of time-steps are 
involved. On the current PowerXCell processor, the over- 
all performance of our code is likely to be much better. 



VI. CONCLUSIONS 

We demonstrate that it is possible to obtain a signif- 
icant performance gain (upto 97% of theoretical maxi- 
mum) on a typical finite-difference PDE solver by mak- 
ing use of the SPEs of the Cell processor, via a pro- 
gramming model that is relatively very straightforward 
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to implement. The approach involves the use of software 
caching through the recently made available effective ad- 
dress space support on the Cell's SPEs. 

By making use of this approach many scientific appli- 
cations may obtain high accelerations on CBE hardware 
with minimal investment in detailed and specialized cod- 
ing. Moreover, it is worth noting that we obtain these 
very strong results from the the original release of the 
Cell processor - the one found in the Sony PS3 - there- 
fore, one can obtain these performance gains using such 



extremely low-cost hardware. 
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